CMSC 476/676 - Information Retrieval
Homework 2 - work individually
Assigned Tuesday, February 27, 2024
Due 10pm Monday, March 11, 2024

The objective of this assignment is to calculate the term weights for the tokens that occur in each document in your collection. To execute your program, it should be sufficient to type:

calcwts input-directory output-directory where the input-directory contains the tokenized files you created as output in Homework 1. The calcwts program may be written in the language of your choice. It is permissible to create at most one set of temporary files. For example, you may tokenize and count frequencies within documents using Python and output that information to one temporary file per input file.

Preprocessing You must extend the preprocessing as follows: remove stopwords, remove words that occur only once in the entire corpus, and words of length 1. Use this list of stopwords.

Term Weighting You may use any of the tf * idf variants discussed in class and/or the textbook to weight your terms. Be sure to clearly describe the formula you chose to use in your report. You must normalize the term weights by the length of the documents. The easiest normalization is to use freq(wordi in documentj) / totalfreq(all tokens in documentj), but you may choose to do the proper square root of sum of squares vector normalization or any other normalization that takes some measure of document length into account.

Output Files The goal is to build one output file per input file. In the output file, there would be one line per token that survives preprocessing. That line would contain the token and the token weight.

Algorithm You may implement either a memory-based or file-based algorithm, i.e., you may create a hash table in memory for each document to store the information about term frequencies, or a temporary file on disk to store the term frequencies (deprecated). You will store the global information (numdocuments per term) in memory. You should time your term weight calculation program on a varying number of documents (10, 20, 40, 80, 100, 200, 300, 400, 500) and plot a graph of the indexing time as a function of the number of documents in your report. The timings should be done on the FULL process (starting with the raw documents).

Program Testing
Use the same set of files as in Homework 1.

Program Documentation.
After your internal documentation (comments) are complete, write a report which provides a 2-3 page summary of your program. In particular, discuss how you extended/improved the preprocessing from Homework 1, if you made any changes. Show two examples of input documents and the resulting (surviving) tokens and their weights. Discuss the efficiency of your approach in terms of your timings on a Linux machine (use the time command) as a function of number of documents indexed. The entire document should be four pages maximum. I will primarily be grading from the report, so make sure it clearly describes what you did and what your program's output and efficiency.

Hand In
Using Blackboard as usual, submit your code including shell scripts, the report in pdf format, and using 025.html, give me 025.wts, i.e. the term weighted document.

Late Policy
10% deduction per 24 hours. Assignments turned in during or after the class will result in a 10% deduction.

Grading Rubric
Program runs to completion (50 points)
Performance analysis (table or graph, for different numbers of input documents) (15 points)
Report, include discussion of term weighting scheme used, (35 points)
and some details TBD

Extra credit for implementing BM25, using suggested values for K1 and S1, or explain your own choice

Extra extra extra credit for implementing in Haskell :-)